NASA TECHNICAL TRANSLATION
NASA TT F-11 ,252
SPEECH COMMANDS IN CONTROL SYSTEMS
Translation of "Ustnyye Komandy v Sistemakh Upravleniya"
Izvestiya Akademi i Nauk Estonskoy SSR, Seriya Fiziko-
Matematicheskikh i Tekhnicheski kh Nauk, Vol. 15, No. 3,
pp. 377-399, 1966
NATIONAL AERONAUTICS AND SPACE ADMINISTRATION
WASHINGTON, DC 20546 APRIL 1 968
NASA TT F-l 1,252
SPEECH COMMANDS IN CONTROL SYSTEMS
ABSTRACT. A review of the literature dealing with auto-
matic recognition of speech sounds. The problem of
increasing channel carrying capacity is considered. A
study is made of the mechanism of sound formation. The
principles of operation of band-pass, formant, scanning,
harmonic and correlation voice coders are outlined. A num-
ber of signal devices for recognizing speech signals are
described, and the use of universal computers as a means of
studying and recognizing speech signals is discussed.
1. Multiplexing of the Communications Channel
As is known, a speech signal consists of a sum of individual oscillations / 377
of various frequencies and amplitudes. When it is expanded into a series,
summation may be performed either with respect to elements equally spaced by
frequency (Fourier series) or with respect to elements equally spaced by time
(Kotel'nikov's theorem). In the first case, the speech signal is subjected to
harmonic analysis and the amplitudes (and phases) of the harmonics are trans-
mitted through the channel; at the receiving point, the speech signal is
restored using these spectral coefficients, determined by the analyzer at the
transmitting end of the communications channel. In the second case, pulses
are transmitted through the communications channel at discrete time intervals;
the amplitudes of the pulses are proportional to the instantaneous values of
the function, read at intervals At. At the receiving end, these pulses pass
through a filter, the output of which is constantly added.
The information transmitted contains, in addition to the useful informa-
tion, a certain quantity of noise. A criterion characterizing the signal level
in comparison to the noise level is the value H = log p/p , where p and p are
the mean powers of signal and noise respectively. The product of three
quantities V = TFH is called the signal volume. A communications channel is
also characterized by three quantities: T, , the time interval during which the
channel is connected; F, , the band of frequencies transmitted through the
channel; and H, , the power level of the apparatus making up the channel. The
product V, = T, F,H, is called the channel capacity. The condition V, > V must
be assured if the signal is to pass through the channel.
1 Numbers in the margin indicate pagination in the foreign text.
Multiplexing of a communications channel can be achieved by deforming one
of the pairs of these quantities while leaving the third quantity unchanged.
Deformation of the signal is performed by compressing it at the transmitting /378
end of the channel and expanding it at the receiving end. Changes in H, T and
F correspond to changes in amplification, delay of the signal using a delay
line and frequency respectively. For example, if a speech signal is recorded
on magnetic tape, transmitted at double speed, re-recorded at the receiving
end and then played back at half speed, the volume of information transmitted
remains unchanged, but transmission is performed twice as fast at twice the
Companding of a speech signal, i.e. compression at the transmitting end
and expansion at the receiving end of a communications channel, may be
frequency, amplitude or time companding. One extreme form of amplitude
companding of a speech signal is clipping. In this case, the amplitude of
the speech signal is limited at two levels and only the points at which the
function changes its sign are transmitted.
As we know, the smaller the base of a number system, the greater the
number of digits required to represent the same numbers. An optimal system
would be a system with the base e. However, it is impossible in practice to
produce such a system, and the base used must be either 2 or 3. The most
widely used system is the binary system, although the trinary system would be
more efficient, since 3 is closer than 2 to the value of e. Speech clipping
(Figure 1) corresponds to a vinary number system of transmission. As
I. Licklider has stated [111, 112], when a speech signal is limited to two
levels, a sufficient quantity of information still remains in the signal to
provide intelligibility at the required level. The technical conditions for
impulse telephony call for 128 levels. Consequently, clipping a speech
signal decreases its volume by a factor of 7.
The intelligibility of speech increases if the speech signal is differ-
entiated before clipping. With this technique, the frequency of the clipped
signal is increased and the location of the extreme points of the original
signal is transmitted. Partial companding is achieved by dividing the
frequency of this signal at the transmitting end and correspondingly
multiplying it at the receiving end [36, 156] . The spectrum is narrowed by
only a factor of 6, and great distortion of the speech results; therefore,
this companding is not particularly promising. However, reduction of the
frequency range of speech by limiting the upper and lower ends of the spectrum
has the opposite effect. The human voice occupies a frequency range from
50-60 Hz to 15-20 KHz. If the only demand placed on a telephone conversation
is intelligibility and recognition of the voice of the person speaking, the
upper frequency limit can be reduced to 2.5-3 KHz. If intelligibility alone
is required, the signal volume can be decreased still further. Under noise
conditions, limiting the frequencies transmitted over a telephone channel to a
maximum of 3500 Hz and a minimum of 300 Hz increases intelligibility of the
speech [31, 130, 136].
Figure 1 . CI ipping of
a Speeeh S ignal :
a, Original speech
signal ; b, CI ipped
Time companding of speech eliminates
certain time intervals, and the pauses which
arise are filled with other transmitted
material. At the receiving end, the pauses in
the speech are filled in by the listener ment-
ally [25, 36, 39, 73]. However, the low deg-ee
of companding achieved (up to 2) and the
reduction in intelligibility indicates that
this method has no particular future prospects
. Better results for time multiplexing of
communications channels are yielded by using
the pauses between words and phrases in natu-al
speech, as well as the pauses on one line while
the conversation partner speaks on the other
line. The switching of individual conversa-
tions on one line into the pauses in other
conversations can be performed by clipping the
speech signal. With this system, four to six
conversations can be transmitted through one
Parametric companding methods, although they allow considerably greater
compression of a communications channel, disrupt the micro structure of the
speech signal: only the parameters produced by a speech signal analyzer are
transmitted through the channel, and at the receiving end these parameters are
used to control a speech synthesizer. Thus, the parametric methods of
companding involve automatic recognition and synthesis of speech signals.
The devices used to transmit the speech signal by the parametric method
have come to be called vocoders [59,60], In semi-vocoder devices, the para-
metric method is used to transmit only the upper portion of the speech
signal, and the lower portion is transmitted continually . In addition to
providing multiplexing of communications channels, vocoders can be used to
provide secrecy for telephone conversations [2, 102, 157],
Investigation of the Formation of Sound
A number of works have been dedicated to the investigation of the
functioning of the ear [14, 84, 92, 114] and the vocal cords [64, 89, 94,
165]. In , six male voices are investigated by applying Fourier analysis
to one oscillating period of pressure in the speech channel, while in [105,
106] , the dependence of the base tone on air pressure in the glottic chink is
determined, and in , sound formation and the determination of formants by
anatomical measurements of the vocal tract using X-rays and electronic computer
processing of the data are investigated.
Speech is created by pulses from the vocal cords, the oscillating fre-
quency of which determines the pitch of the base tone. These sound pulses have
a discrete spectrum with a large number of harmonics over a broad frequency
range. The frequency of the base tone lies primarily between 80 and 350 Hz.
According to the data of some authors, the amplitudes of the harmonics are
almost identical over a broad frequency range , while other authors
indicate that the amplitude of the harmonics decreases regularly with
increasing frequency . The audible signal is produced from the sound of
the vocal cords as it passes through the resonating cavities of the mouth and
nose, as a result of which the amplitudes of certain harmonics of the vocal
cord sound are decreased or completely suppressed, while that of others is
reinforced, forming resonant peaks (so-called formants), the measurement of
which has also been the subject of a number of works [65, 66, 165].
Each voiced sound corresponds to its own combination of formants.
According to the data of , in Russian speech the vowels oo, oh, ah and ee
are characterized by a single formant, while the sound eh has two and the
sound y 1 -- 3. According to the data of [64, 66, 165], good recognition of
vowels requires that the first three formants be determined; according to
, only two need be determined. Unvoiced consonants have no clearly
expressed formant areas and are distinguished by the amplitudes of the zero
(M ) , first (M..) and second (M») orders, characterizing the spectrum in the
band selected  :
M~ = 2A , M, = Sf A , M„ = Sf 2 A 2 ,
n' 1 n n' 2 n n'
where A is the amplitude of the n-th band of the spectrum, f is its mean
The sound of the vocal cords h(t) has the frequency spectrum of the even
and odd portions of the resonators, i.e.
The resonators of the oral and nasal cavity, depending on the speech sound
being formed, have transfer functions G(jw), i.e. G (jw) , G (jw) , G (jw) , etc.,
3- o y
corresponding to the vowels a, o, y, etc. The signal, upon leaving the mouth,
is determined by the equation
1 A vowel sound with no exact equivalent in English -- Tr.
Since the sound of the vocal cords has a spectrum almost identical for / 380
all people, the audible signal is determined only by the transfer function of
the resonator (C(jco) = G(jw)H(jw). Each phoneme corresponds to its own
standard transfer function. On the basis of this phenomenon, a device has been
developed for recognizing a man from his voice , as well as for producing
information concerning the condition of an astronaut during flight .
Each sound has its own shades. The timbre of the voice of any man
differs depending on the properties of the resonators and the change in the
base tone during speech. The amplitudes and frequencies of the formants of
each phoneme may change within certain limits [87, 113, 131]. This fact makes
more difficult the design of a device for automatic recognition and synthesis,
since the same concentrations of energies at the same frequency may belong to
different phonemes .
There are various opinions concerning the significance of the formants.
Some investigators believe that their definition can have no decisive signif-
icance in developing a device for recognizing speech signals [79, 121];
however, most authors still hold the opposite opinion [24, 67]. For example,
if we record the vowel ah on magnetic tape and suppress certain formant areas
within it, the ah is converted, for example, to an oo.
As we know, most phonemes, including the vowels, can be reproduced by a
whisper, without using the vocal cords. The exciting action is the breathing
noise, which can be looked upon as a random, stationary process n(t), with
spectral density N(w). Then, the spectral density of the output quantity, i.e.
the phoneme produced, for example ah, will be
A-,(.,j) = |C a (/iJ)t-A/(o)).
As we can see from this formula, the base tone carries no information
concerning the phoneme, and therefore in developing an apparatus for automatic
speech recognition, there is no need to take the base tone into consideration.
The transfer function of a phoneme G(jw) contains both amplitude and phase
data. As we know, the hearing does not react to a change in the phase shifts
of a complex signal; therefore, in developing an apparatus for automatic
recognition, these shifts can be ignored; however, in synthesis, a phase shift
between harmonics worsens the quality of speech sound considerably [35, 41, 108,
Dynamic spectrographs (so-called videographs) have been developed for the
analysis of speech signals; these videographs make it possible to produce a
three-dimensional representation of speech: the abscissa represents time, the
ordinate represents frequency and the third coordinate, the darkness of a point
on the frequency- time plane, represents the amplitude of the signal at the
given frequency [128, 137, illeg.] (Figure 2).
* r -t
' ■ %
Y .'■? /
Figure 2. Videograph of Syllables (example taken from [6*t])
Videographs are divided according to their principle of operation into / 581
devices with sequential analysis and devices with parallel analysis. One of
the first videographs was developed by H. Sund . This device also makes
it possible to photograph the amplitude-frequency dependence on a cathode ray
tube. One defect of Sund's videograph is the impossibility of observing and
recording the time envelopes of the speech signal.
One modification of the videograph is the intervalograph  . In this
device, a signal is produced whose amplitude is proportional to the intervals
between transitions of the speech signal through the zero level.
On request by the Institute of Language and Literature of the Academy of
Sciences ESSR, the former Scientific Research Electronic Engineering Institute
of the ESSR Council of the National Economy prepared a spectrograph with
parallel analysis. A set of 52 filters is used to cover the range from 40 Hz
to 14 KHz. The spectrogram of the speech signal being analyzed is produced
simultaneously on three cathode ray tubes, making it possible to photograph
and visually observe changes in the spectrum of the speech signal.
A new variant of a spectrograph manufactured by General Electric has
80 filters with a spectral width of 75 Hz each, a local oscillator, a scanning
device, an amplitude logarithmizing device, and a camera for photographing the
results of analysis. The spectrograph can also be used for analysis of the
songs of birds, machine noises, the operation of the heart, etc. [186, 187].
Using the videograph it has been established that the accuracy of recog-
nition is increased not only by determining the position of formants and their
intensity, but also by establishing the rate with which they change in
frequency and level.
Speech is a continuous function of time between pauses for breathing, and
consists of individual, discrete phonetic elements -- phonemes. The phonemes
are varieties of sounds which depend on their pronunciation. There are always
more phonemes than sounds. In the Russian and English languages there are
approximately 40, in Estonian there are approximately 30 and in German there
are approximately 40  .
The problem of recognition of speech sounds can be solved either by
comparison of one phoneme, syllable or word from the corresponding set stored
in the memory of the machine, or by acoustical characteristics alone. In the
first case, the machine performs comparison of the characteristics of the input
speech signal with the characteristics of signals stored in its machine memory,
and the input signal is output as the sound with the greatest correlation
coefficient. In the second case, the output signal is formed by analysis
The perception of speech by man can be divided into three stages. In the
first stage, the acoustical stage, a number of physical phenomena are perceived
and a combination of parameters is determined; in the second stage, the
phonetic stage, these parameters are compared with standard parameters in memory
and the initial recognition of the speech sound is performed; finally,
in the third, linguistic stage, the content of the information produced is
clarified (for example, endings are added to words which the speaker may not
have pronounced at all, etc.).
Automatic recognition of speech sounds is based primarily on the perform-
ance of the first two stages. In the general case, the machine must contain
devices storing information concerning the language in order for complete
recognition of speech sounds to be possible.
The most widespread methods of recognition of speech sounds are analysis
of the spectral, time and spectral-time characteristics.
The spectral method was first described by L. Myasnikov [16, 17, 18], then
later by others as well [61, 129]. According to his suggestion, the audible
oscillations are analyzed by pairs of filters with the following pass-bands:
500-700 and 800-1000 Hz; 1250-1500 and 4000-5000 Hz; 650-750 and 5500-6500 Hz;
1250-1500 and 400-500 Hz. The output of each filter pair is detected and fed
in counterphase to an indicator whose arrow either remains at the central
position or is deflected to one side or the other. The phonemes are differ-
entiated only according to the frequency, regardless of the amplitude. The
replacement of the indicator with the arrow by a three-position, polarized
relay was used to produce a recognition accuracy of up to 75-80%.
C. Smith used 32 filters in his device [160, 161], and employed the / 382
principle of accentuation of formant peaks by determining the amplitude
differences in the adjacent filters. The signals produced were fed to a
comparison system which reacted only to signals with maximum amplitude. A
certain combination of channel numbers, i.e. formant frequencies, corresponds
to a certain vowel phoneme. However, the results were unsatisfactory due to
the fact that the uncertainties resulting from the influence of neighboring
phonemes on each other and displacement of formants resulting from changes in
the base tone could not be eliminated.
The investigation of formants and the spectral method of analysis were
used as the basis for construction of band, formant, scanning, harmonic and
In the band type vocoder, the entire range of the speech signal is
analyzed in band filters having either even frequency division throughout the
entire range or even frequency division only in the lower range (up to
1000 Hz) with logarithmic frequency division in the higher range (above
1000 Hz) . The outputs from each filter are rectified and used as para-
meters for analysis of the speech signal.
In the first vocoder  , the speech signal was analyzed by ten main and
two supplementary filters. The band width of the first filter was 250 Hz, of
the others -- 300 Hz; the entire range analyzed was 2950 Hz wide. The output
from each filter was rectified, passed through a supplementary low frequency
filter (tuned to 0-250 Hz) and transmitted through a communications channel to
the synthesizer. The synthesizer was a noise generator with a frequency limit
of 250-3500 Hz. The output of the generator was connected to the generator
filter, which had the same frequency ranges as the analyzer. Thus, the output
of the generator filter was controlled by corresponding output of the analyzer
filter signal. The base tone was separated from the signal before it passed
through the analyzer filters and passed through a second, supplementary filter
with a frequency pass-band of 0-50 Hz for smoothing.
Since in this vocoder all the higher harmonics of the speech signal are
not transmitted, the synthesized speech sounds rigid and hoarse, but the
intelligibility is satisfactory.
The communications channel carries ten main signals and two supplementary
signals. Each channel requires a band width of approximately 25 Hz, i.e. a
total of approximately 300 Hz.
If we compare uncompressed speech signals with identical volumes of
information but with different ratios of frequency and dynamic ranges, we have
whereP , P , P and P are the power of signal and noise at the trans-
c l n l c 2 n 2
mitting and receiving ends of the communications channel; F and F„ are the
corresponding frequency bands; c, and c~ are the throughput capacities for
transmission of quantities of information.
Assuming that c. = c ? , i.e. the throughput capacities of ordinary and
vocoder transmission are identical, where F, = 3000 Hz and F 2 = 300 Hz we have
P /P = (P /P ) , i.e. theoretically the vocoders require ten times
c 2 n 2 c l 1
greater dynamic range. Actually, this requirement is overstated and, according
to the data of a number of authors [35, 63], if transmission of uncompressed
speech requires a signal/noise ratio of about 30 db, a band type vocoder with
ten channels requires a ratio of about 40 db .
Vocoder signals can be transmitted either continuously or in the form of
pulses. The continuous signal has a spectral width of not over 15-50 Hz. The
dynamic range is somewhat less than that of the original speech, not exceeding
25-30 db. With pulse transmission, vocoder signals are quantized by level and
time. In frequency, not over two samples per Hertz are required, while in
amplitude, samples each 1-1.5 db will suffice. With a spectral width of
25 Hz and a dynamic range of about 16 db, the channel should provide a
capacity of 200 bits per second (25 X 2 = 50 samples; 2 4 = 16 and 4 X 50 = 200)
or a total throughput capacity of about 2000 bits per second for a ten-channel
We know that the transmission of pulses requires a frequency of no less /383
than 1-1.5 Hz per bit/sec, so that transmission at a rate of 2000 bit/sec
requires a channel with a frequency width of 3000 Hz.
According to information theory, speech can be transmitted without distor-
tion when the following condition is fulfilled:
A>^rD F bit/sec
where A is the throughput capacity of the channel; D is the average
effective dynamic range, F is the width of the frequency range of speech. If
we consider that the speech signal range is 5000 Hz, the mean dynamic range is
30 db, then A > 5*10 (bit/sec); consequently, according to the data above, the
loading of communications channels can be decreased by a factor of approx-
In a semi-vocoder, the lower frequency of speech (up to 600 Hz) is trans-
mitted without conversion, while the high frequency area (above 600 Hz) is
analyzed and transmitted by a vocoder [69, 155]. In comparison to vocoders,
the required channel volume in this case is increased, but the intelligibility
of one-syllable words is increased from 74 to 84%.
In a scanning vocoder [175, 176, 178], the speech signal is analyzed in
100 filters, the outputs of which are detected and stored in condensers. The
voltage level of the condensers is transmitted by a rotating commutator at
30 rotations per second to a synthesizer, where the sound is restored using a
controlled multivibrator. Vocoders have also been developed in which the base
tone is not transmitted ; in this case, the intelligibility produced is up
transmitted in linear combinations. The difference is in the synthesis at the
receiving end: in band type vocoders, the parameters are fixed by filters with
the same bands as the analyzers, while harmonic vocoders have filters at the
receiving end corresponding to the expansion into the Fourier series.
ci rcui t
ci rcu i t
Figure 3- Block Diagram of Formant Vocoder
The correlation method of analysis is based on the relationships between
the autocorrelation function R(r) and the energy spectrum of the signal S(o>)
[23, 33, 154]:
The correlation method allows us to avoid the influence of the effects of
phased shifts in the synthesized speech, which appear in band type, formant
In a pulse vocoder , there are ten evenly distributed filters, the
outputs of which are rectified and used to control pulse generators so that
pulse width modulation is produced, the number of pulses for each formant being
proportional in amplitude and frequency. According to the statement of the
author of , this vocoder has great interference stability.
The spectral-time method of speech recognition differs from the spectral
method in that the output of the filters is scanned with respect to time as
well as frequency. As a result, phonemes are transmitted through the communi-
cations channel in the form of code symbols [140, 142] .
A further development of the spectral method is the formant method. It
consists of determination of the presence of a given formant in a given band
of filters [67-69, 75] (Figure 3). In vocoders of this type, up to four
formants are distinguished. Improved formant vocoders, in which the intensity
as well as correlation between frequencies and amplitudes of formants are
analyzed, have been developed by several authors [3, 50, 64]. Formant
vocoders in which the moments of first and second order are considered have
given better results, particularly for determination and synthesis of conson-
ants than the vocoders described above [43, 63, 85]. It has been suggested
that the first formant be determined both in the 250-850 Hz range and in the
300-1200 Hz range. According to , it is sufficient to have an upper limit
to the filter defining the first formant of not over 1000 Hz. The second
formant is determined between 900 and 2300 Hz. Since the frequencies of
formants overlap in this vocoder, it has been suggested that the position of
the first formant be determined first. If it is between 700 and 900 Hz, the
frequency of the second formant must be no lower than 1400-1800 Hz; if,
however, the first formant is considerably lower than 700 Hz, the frequency of
the second formant lies between 700 and 900 Hz and it is necessary to retune
the filters accordingly. If a third formant is transmitted as well, it will be
defined in the 2100-3500 Hz band, the fourth -- in the 4000-6000 Hz range. / 384
The moments are determined by differentiation of the spectrum. If the base
tone is transmitted by two parameters (frequency and level) and the four
formants are transmitted by three parameters, 14 signals must be transmitted in
all. If the moments as well as the dynamic indicators of the formants such as
rate and direction of change of frequency and level are transmitted, the
number of signals transmitted is increased.
Scanning vocoders also have analyzing filters, but the signals are trans-
mitted sequentially in time to the expander  . Whereas in formant
vocoders the speech signal is analyzed, in contrast to band vocoders, only at
frequencies corresponding to the position of the formants in the speech signal,
harmonic vocoders analyze the speech signal completely, determining the Fourier
coefficients and transmitting the terms of the series (except for the constant
component) through the communications channel using two parameters. At the
receiving end, these parameters control either a discrete spectrum generator,
or a noise generator, sometimes both[20, 21, 27]. In their principle of oper-
ation, harmonic vocoders differ little from band type vocoders. In both cases,
the ordinates of the spectrum are determined, except that in band vocoders they
are transmitted without conversion, while in harmonic vocoders they are
and harmonic vocoders due to the presence of a complex impedance in the band
filters of the synthesizer. This achieves compression by a factor of 10
In order to improve intelligibility of the phonemes in the spectral-time
method, it has been suggested that all phonemes be preliminarily divided
according to their characteristic indicators [94, 95]. An electronic binary
system [43, 45, 185] separates voiced and unvoiced sounds using filters -- the
voiced sounds have a base tone, the unvoiced sounds do not. In the next step,
the noise voiced sounds are distinguished from the non-noise voiced sounds
by the presence or absence of the first formant at the output of the filters.
Unvoiced sounds are divided into plosive and fricative by the difference in
their amplitudes. The unit for separating voiced sounds into noise and non-
noise types operates with an accuracy of up to 95%. The unit for separating
vowels into higher and lower vowels operates with an accuracy up to 98%, while
the unit separating lower vowels into diffuse and compact operates with an
accuracy of up to 94%.
Comparative data on the volume of the communications channels are shown
in Table 1 .
Required channel volume,
discrete form of speech signal
word (120 words per minute)
a) vocabulary of 2 words
b) vocabulary of 8,000 words
teletype (120 words per minute)
Considering that the speech signal range is 3,000 Hz.
A number of works have been dedicated to improvement of the technical
indicators of vocoders [49, 114, 164]. For example, digital vocoders have been
developed where in place of exciting the synthesizer with the voice, as was
done in the first vocoders, the excitation is performed by the base tone
which in turn is used to turn on a special generator. The synthesizer uses a
reducing device, which eliminates the noise arising as a result of variation
of the base tone . A new system of compression has been developed which
does not require separation of the base tone  . The usage of a multi-
channel modulator has reduced the transmitted signal spectrum to 1.4 KHz .
A vocoder system has been suggested allowing a considerable reduction in the
volume of the communications channel (to 94 bit/sec) ; the operation of the
vocoder is based on the usage of a memory device in which all phonemes are
stored, each phoneme having its own code and being "called up" by digital
signals  .
Another suggestion for reducing the volume of the communications channel
is based on the usage of syllable synthesizers. A code for the syllables in
the form of digits is obtained in the analyzer, and the digits are then trans-
mitted to the synthesizer and the required syllable is selected on the basis of
these code digits. A synthesizer with a volume of 200 syllables has been
constructed on this basis  .
The EVA electron-analog synthesizer reproduces speech sounds using curves
drawn with current conducting ink on a tape or drum  . The transmission of
speech by this method can be performed at a frequency 30 times lower than that
required for ordinary telephone conversation. The intelligibility of words
reaches 75%. According to a statement by the author of , intelligibility
can be increased to 85%, and the frequency compression increased to 1,000.
This type of transmission requires very little power, which is extremely
important in establishing communications with astronauts.
4. Special Devices for Recognition of Speech Signals
The spectral-time method was studied in detail by H. Olson and H. Belar
[122, 123] and J. Dreyfus-Graf [57, 58], who developed typewriters controlled
by oral commands. The first machine of H. Olson and H. Belar typed words from / 386
a vocabulary of ten single- syllable words in the memory of the machine; the
third typewriter had a vocabulary of 150 single-syllable words (Figure 4) .
In the opinion of the authors, a typewriter with a memory of 1,000 sound
combinations is sufficient for practical uses  (obviously, this require-
ment is also sufficient for the Russian and Estonian languages, since in these
languages phonetic transcription and printed form differ little from each
other) . The latest model of their voice powered typewriter consists of eight
band-pass filters, eight amplitude- comparing detectors, syllable and ortho-
graphic memory units, control devices and the typewriter. The filters cover
the range from 250 to 20,000 Hz. The output of each channel is transmitted to
a device where the maxima are accentuated by comparing the levels of the
outputs of neighboring filters and the second derivative of the curve of the
speech signal is defined. Quantization of the input signal with respect to
time occurs each 0.2 sec, and after its output from the filters it is
quantized in the memory each 0.04 sec (in all, also 0.2 sec) and is quantized / 387
by amplitude in three levels.
The comparison circuit compares the output curve of the speech signal
before and after quantization, i.e. each 0.04 sec. If no change in signal
amplitude occurs during this time, the signal does not reach the memory. The
memory has 256 different characters, including pause. The results of analysis
are transmitted in an eight digit code, which at a rate of pronunciation of
ten phonemes per second requires a rate of 80 bits per second. The accuracy
of operation of the machine is 92-94% if it is adjusted to the speaker using
it, considerably lower if random speakers are used.
Compressor l^. l4 . «. .,
K . filters apparatus
Expe r i men
: Syllable «=
oo <u — +i>— :
^ >. -QO
sj/elopment <2I I
la oJ in i-—'
_. C (U+->0
o Time ■' (o t
- — n — *i~
Figure k. Block Diagram of Phonetic Typewriter of H. Olson and
H. Olson and his colleagues developed a device in which recognition,
recoding, printing and synthesis of speech is performed . The memory of
the machine consists of four English, eight French, four German and four
Spanish words. The rate of operation of the machine is 60 words per minute.
The accuracy of the operation with known speaker is 96-98%.
In the fourth model of the phonetic typewriter, still under development
by J. Dreyfus-Graf , according to an announcement by the author, the
specific features of the speech of various speakers have been eliminated. As
in preceding models by this author, there is no memory, so that its accuracy
of operation depends to a considerable extent on the carefulness of pronunci-
ation of the speaker. Analysis of the speech spectrum is performed in ten
filters (400-3200 Hz); also, there are other filters covering the 150-300 and
4500-6000 Hz range. Each spectral channel contains a detector and a low
frequency filter for the determination of subformants (0-30 Hz) and a
quantizer. The rate of change of the speech signal envelope is also
determined. As a result, the speech signal is separated into 48 signals, each
of which is quantized by time each one-fifteenth second and by three levels.
The total information produced amounts to 1440 bit/sec, and the alphabet of -
the machine has 30 letters. No data on the accuracy of operation of the
machine have yet been published.
The writing machine of M. Kalfaian  has a device for reducing the
speech signal to a single base tone frequency using a generator controlled by
the base tone of the input signal. They are filters whose outputs are
compared after rectification with data contained in the memory by using
electronic relays; the machine prints phonemes in a phonetic alphabet, not
the ordinary written alphabet. No correction for errors in pronunciation is
performed. The details of the operation of the machine have not yet been
The mechanical speech signal recognition machine developed at London
University has, in contrast to the preceding machines, a linguistic memory.
The inventor of the machine  states that the results of analysis can be
used to control a typewriter, but he does not expect particularly great
success. The recognition machine can distinguish four vowels and nine conson-
ants. The differentiation is performed according to the maximum voltage from
two of eighteen filters covering the speech signal range from 160 to 8000 Hz.
It has been established, for example, that the vowel i corresponds to the
maximum product of the outputs of the filters at 250 and 3200 Hz, the
consonant m -- to the maximum product at 200 and 320 Hz. Consonants with
almost identical spectra are distinguished further either according to length
or voltage level. The linguistic memory contains information on the
probability of sequences of two phonemes. The device consists of several
potentiometers, and the information is input to the matrix of potentiometers
by the position of their slide contacts. The acoustical device determines the
form of the spectrum, length and intensity of sound and presence of base tone.
The linguistic device contains standard combinations of phonemes present in
speech, and combinations which do not exist in speech are forbidden. Thus,
recognition occurs in the acoustical device and correction, i.e. improvement
of intelligibility, occurs in the linguistic device. Experiments with
200 words have produced an accuracy of recognition of 72% (45% for any
The correlation method of recognition is also based on multiplication of
filter outputs. A device developed in the Bell Laboratories is designed for
recognition of ten numbers pronounced by any subscriber  . The speech
spectrum received is multiplied by a typical phoneme spectrum, after which the
average value is determined. The maximum value for multiplication corresponds
to the phoneme which is most similar to that received. The device consists
of a complex containing ten relays, only one relay giving an output pulse for / 388
any one digit. The accuracy of recognition is 97-99% with a known speaker,
50-60% with any speaker.
In Japan, a device has been developed to recognize ten numbers pronounced
separately. This process is performed using eight characteristics (number of
voiced intervals per word, position of first and second formants at the
beginning of the first voiced interval, position 100 msec after beginning of
first voiced interval, etc.), with from two to thirteen levels each. The
voltages corresponding to the levels of these eight characteristics are sent
to a matrix circuit in which the conditional probabilities of all numbers are
calculated. The highest conditional probability corresponds to the number
actually pronounced. In pronunciation of one thousand words by one speaker,
an accuracy of 99.7% was achieved, while when twenty different speakers (all
men) pronounced one thousand words, an accuracy of 97.9% was achieved .
The results of an experimental investigation on the determination of
various characteristics of numbers pronounced in Russian are presented in
The time method of recognition is based on the usage of clipped speech.
This method was presented in the works of I. Licklider  and subsequently
in the works of other scientists [145, 150, 174], including Soviet scientists
. G. Tsemel'  uses clipped speech for differentiation of certain
consonants. According to his data, the accuracy of recognition of the sounds
p and t was 95%, the sound k -- 75%.
A. Rais  analyzed consonants based on their good differentiability
by the ear. The experiments were performed to determine the capabilities for
input of data in oral form to an automatic translation machine. It was
established that the dependence of the number of pulses of clipped speech on
the time for various vowels pronounced by different speakers is almost linear.
The results of the experiments with three speakers are presented in Table 2.
Preliminary differentiation of the signal was not performed.
I. Toffler suggested a method of separating the base tone from speech
signals by using nonlinear elements and the clipping method .
Clipped speech was also used at Kyoto University . Analysis was
performed both for vowels and for consonants. According to the methodology
used in this work, the speech signal was amplified to a preselected level,
then fed to the input of a stage of Kipp oscillators, the outputs of which are
pulses corresponding to the moments of changes in the direction of the
rectangular clipped speech signal. The Kipp oscillator stage, the pulse
lengths from which have different values in different cases, classify the
clipped speech signal into fourteen values from 0.59-1.11 to 22.55-
32.22*10 sec. In each channel, the number of pulses is determined by a
counter. Analysis of vowels is performed according to the distribution of
time intervals W(t) and the univariate distribution of probability W 1 (t)
according to the formulas
lT(t m/ )= i -g- and W x (x ) = i p^- (i = i... M).
where t . is the mean time interval in the i-th channel: v . is the number of
mi ' i
intervals in the i-th channel; At . is the time difference between the upper
and lower boundaries of the i-th channel.
A writing machine based on analysis of clipped speech was constructed at / 389
the same university , and has a memory volume of 200 syllables. The
details of operation of the machine have not yet been published.
A device for automatic distinction between twenty spoken words (the
numbers from zero to nine, plus, minus, space, forward, back, etc.), based on
clipping of preliminarily differentiated speech signals, was developed at the
Academy of Sciences of the Georgian SSR  . It has been stated that the
device is capable of distinguishing up to twenty spoken commands when tuned
for a particular voice, or on the order of five commands from many voices.
The portable electronic calculating machine SHOEBOX, as yet the only
series produced model, reacts to spoken pronunciation of the ten numbers (from
zero to nine) and six additional commands: plus, minus, sum, partial sum,
error, clear. The word "error" stops the device and clears all operations
performed. Recognition occurs according to the location of the first phoneme,
stressed vowel, last phoneme and time envelope of positive and negative peaks
of the curve of the word. The operation of the device is unstable with
different speakers  .
Of the three writing machines which we have described [58, 123, 149],
the most promising device, according to the materials of the Stockholm Seminar
of 1962  is the machine of J. Dreyfus-Graf, which has no limiting memory
unit [32, 141].
Let us present certain other data concerning special devices for the
analysis, recognition and usage of speech signals.
The analysis and recognition of audiofrequency signals can be performed
using a device which has been developed, the operation of which is based on a
resonant mechanical system consisting of glass fibers of various lengths and
diameters up to 0.05 mm. One end of each fiber is fastened to a special
shaped base, the other end can oscillate freely under the excitation of the
audio waves. Using light beams, the oscillations of these free ends are
projected through a standard screen onto photo diodes. The element consists
of 2,000 fibers with a total volume of 16 cm , analyzing the frequency range
from 30 to 20,000 Hz with a resolving capacity of about 10 Hz. Each standard
screen stores the signals of prototypes. The photo element integrates the
entire quantity of light, and the more closely the signal being analyzed
corresponds to the standard signal represented by the standard screen, the
greater the total light flux. If the flux exceeds a given threshold, the
signal is recognized. Since the standard can also be changed in correspond-
ence to the information produced, this element has characteristics of self-
teaching. The device recognizes only short words pronounced by a given voice.
This device was constructed in an attempt to establish communications with
dolphins. It was established that the "speech" of dolphins consists of short
signals (approximately 0.1 sec) at frequencies from 5 to 10 KHz [180, 181].
Harmonic analysis of periodic functions fixed in the form of graphs or
tables can be performed using an electromechanical harmonic analyzer allowing
simultaneous production of five pairs of coefficients of Fourier series with an
accuracy to 0.3% of the maximum value of the function being analyzed .
A device has been constructed in which the recognition of commands
(numbers) is achieved using visual data on the movement of the lips. Light
sources with directed reflectors are installed on each lip. A photo resistor
with a collecting reflector is placed before the lips, connected to one arm
of a balanced bridge. The voltage taken from the bridge is amplified, fed to
a differential amplifier and thence to a strip chart recorder. The light
sources and photo resistors with reflector are connected to the head of the
speaker. The intelligibility of ten numbers for a concrete speaker was 91%,
for two different speakers -- 78.3%. By determining one more parameter -- the
rate of movement of the air flow near the lips -- the intelligibility for a
single speaker could be increased to 100%, for two speakers -- to 81% .
A system of solenoids has been theoretically developed, using which
different codes can be produced for 24,000 English words up to 16 letters
long; however, this system for word recognition has not yet been constructed.
Essentially, the system is a memory device in which the curves of speech
signals are stored in digital form and which can be used to produce momentary
values of the correlation function of the speech signal [38, 134].
Based on the usage of relay systems, a device has been created for / 390
conversion of digital information to speech information. The sounds of the
numbers zero to nine and the words "volt," "second," "power" are recorded on
the tracks of a magnetic drum. The address of each drum track is output by a
ring counter, and the recording thus selected is sent to an audio amplifier
The neuron network and auditory apparatus of man have been modeled by-
several authors [19, 74, 83, 90]; also, a correlation theory of hearing has
been developed . On the basis of this model, a device has been developed
for the recognition of sounds and individual words. The device consists of
an output amplifier, a system of filters and a unit of logic and computer
circuits modeling the outer and middle ear, the inner ear and the nerve net-
works respectively. The voiced consonants b, d, g and unvoiced consonants p,
t and k are differentiated using differentiating logic circuits; however, no
complete electronic analog has yet been produced  .
Experiments have been performed in the "understanding" of speech which is
not heard. Information on a phoneme is transmitted through the hand using
24 vibrators connected to the outputs of a functional model of the auditory
organ. The numbers from one to nine were recorded on magnetic tape. The
experimental subject is told if he makes an incorrect answer, and the number
is repeated. After two to three hours of training, 85% correct answers were
produced. This experiment is of great significance for the deaf .
The human voice is not symmetrical about a transitional axis as is, for
example, the noise spectrum. This peculiarity, "asymmetry of the envelope,"
was used for the creation of a safety switch which turns off a powerful
machine, such as a machine tool, if the operator shouts .
A speech signal analyzer has been developed, consisting of 54 gaussian
type filters . Up to 1000 Hz, the filters have a width of 70 Hz each,
while over 1000 Hz the width increases by 6.5% each step. The output of each
filter is rectified and quantized; the quantized currents can be tracked
visually and, after the proper processing, can be input to an electronic
A speech signal analyzer has been developed which consists of 96 filters
and covers the frequency range from 30 to 8,000 Hz. The signals are normalized,
detected, up to the fourth derivative is determined, and the output of the
analyzer is connected to a triple beam oscilloscope. The process of recog-
nition is currently being automated .
The solution of the problem of separating the base tone is of great
significance both for linguists and for the manufacturer of vocoders and
for solution of the problem of recognition of speech signals in general  .
Several variants of a special apparatus [42, 56, 147] and devices have been
developed which have units for determining the autocorrelation function of
signals in order to separate peak maxima , the widths of formants [62,
170] and other supplementary devices [143, 184]. For example, in  it is
suggested that the phase of the outputs of N filters be changed by 90 degrees.
These outputs are looked upon as multidimensional vectors which change with
time. If there is periodicity in the signal, the vector will form a closed
curve. Five filters of 120 Hz width each are used, covering the frequency
range from 300 to 900 Hz.
In order to separate the base tone in a speech signal, a device has been
developed consisting of elements with nonlinear characteristics and slight
time constants . Another system for analysis of the frequencies of
formants and base tone operates on the principle of the tracking filter with
preliminary transfer of the spectrum of frequencies being analyzed  or
generation of a signal with manual adjustment of the signal to correspond to
the signal being analyzed .
5- Universal Computers As Means of Investigating and Recognizing Speech / 391
The usage of universal computers for the investigation of speech signals
began shortly after computers were invented [48, 70, 83] and has expanded each
year. They are used in connection with spectral analysis of speech signals
[5, 37, 46, 99], for recognition and synthesis of a selected phoneme combin-
ation [38, 88, 91, 116], or numbers [52, 71, 152], to separate the base tone
[46, 72, 81, 120] and determine the parameters of formants [127, 167, 182],
to investigate the operation of vocoders , to determine the possibilities
for input of speech signals into computers for mathematical modeling of
communications systems [15, 46], etc. In , a method is presented for
expansion of a speech signal into a Fourier series and determination of its
constants. The amplitudes of each sequence of frequencies are logarithmized
and analyzed in a second spectral analyzer. The output of this analyzer is
the logarithm of the power spectrum and has peak values in the case of
analysis of voiced phonemes, and has no peak values if the phonemes are
unvoiced or have no base tone. Since the time of changes of the frequencies
to the speech signals cause periodic pulsations in the amplitude, spectrum, the
Fourier transform of the spectrum gives the frequency of the pulsations,
inversely proportional to the frequency of the base tone of the speech.
Spectral analysis has been performed for a group of preliminarily
segmented vowel sounds, using a special device which performed digital coding
of the speech segments being analyzed. These data were input to the computer
and a stepped synchronous analysis was performed using the Fourier method  .
In , a new technique is suggested for measurement of the frequency
and widths of formants of a speech signal, based on the theory of Fant .
A form is assigned to the spectral equations:
f (t) - flo + I e ' (at cos 2nFit + b, sin 2.nf,/),
where N = 3 and 4 (number of formants).
A method is presented for determination of the numerical values of the
coefficients a n , a., b., F. (1 < i < N) by computer using the least squares
method (B. is the width of the i-th formant and F. is its frequency). Based
on the experimental materials from analysis of three words (bought, bottle and
beet) pronounced by two persons twice each, it is affirmed that the first two
formants for o , i and a are found quite reliably.
The energy spectrum of speech signals can be calculated by methods other
than expansion into Fourier series, for example by separation of this spectrum
into polynomials or representation of the spectrum as a Markov process  .
In , the results are presented from a correlation-spectral analysis, while
 presents estimates of the error in the spectral analysis method and 
presents an estimate of the frequency of quantization of the speech spectrum
with correlation and spectral analysis by computer.
It is stated in  that the problem of recognition of various
patterns, including speech patterns, can be reduced to the problem of finding
a class in which a given signal belongs if the total number of classes of
signals in which the signal may be included is known. This problem is solved
by minimizing a certain risk function, as a result of which the optimal
rules are found to be used for solving the problem of recognizing output
signals from an electronic analyzer, a volume of the auditory helix.
A method has been suggested for representing a model of a formant as an
n-dimensional vector, each component of which is a single discrete value of
the formant at a given moment in time . Thus, the dimensions of an
n-dimensional space are determined by multiplying the discrete values of
parameters of the formant by the moments of observation. The computer
develops this data in two steps, called by the inventor of the process
learning and recognition. The results of analysis of each of ten phonemes /39_2
(first letters of the alphabet) pronounced by ten speakers, are presented in
the form of a matrix  .
It has been suggested that a computer program be developed to calculate
the energy and envelope of the speech signal over a fixed time interval, the
frequency of transition of the signal through the zero level and the distribu-
tion of intervals between zeros in clipped speech throughout the entire time
interval, the autocorrelation and mutual correlation functions and also to
perform spectral analysis of the speech signal in order to recognize the
signals . An analog-digital converter with eight digit readout of vinary
numbers has been developed for this purpose for input of sound information
into the computer  .
In , a speech signal is analyzed in a 30 channel band type analyzer.
The computer is used to determine the frequencies F.. , F_ and F each 10 msec.
The sectors of speech during which the formant frequencies do not change are
not taken into consideration. An experiment for the recognition of ten
monosyllabic words spoken three times each by three speakers gave 100% correct
results. Recognition of speakers by their voices was also successful.
The problem of separating the base tone has also been the subject of many
works. In addition to special apparatus outlined above, this work is being
performed by computer, in most cases together with determination of other
speech parameters. In [81, 182], a program is developed which is used to take
a speech signal preliminarily processed in a correlation type analyzer through
an analog-digital converter, introduce it into a computer for determination of
the parameters of the base tone and other indicators of the speech signal such
as: frequency and amplitude of the first three formants, instantaneous signal
power and rate of change of all quantities. The results of separation of the
base tone are compared to data produced in  , in which the determination
of the base tone and rate of its change during the pronunciation of individual
words- and syllables was performed generally automatically, while more precise
determination was made by manual measurement of distances between peaks on
oscillographs of the speech signals. Good correspondence of the measurement
results is noted.
In [167-169], a process is analyzed for separating the frequencies of
formants and determining the formants in one-syllable words of the Japanese
language by computer. First, the speech is preliminarily analyzed in a
filter system covering the frequency range from 200 to 5900 Hz, then the
speech is coded by an eight bit binary word and sent to a computer magnetic
tape recorder. In the computer, the formant frequencies are separated from
the speech, the rate of change of formant frequencies is determined and the
second order moment is defined near the mean frequency; the phonemes are
segmented and phoneme classification of the spectrum is performed. It is
stated that this method allows error free determination of separately
pronounced phonemes, as well as almost error free determination and recogni-
tion of sounds and unvoiced consonants.
In order to automate the exchange of information between a man and a
warehouse, a system has been created in which audible speech is analyzed
according to certain energy characteristics, converted into binary electrical
signals using a device containing frequency filters, analog-digital converters
and decoders, is coded onto punched cards and transmitted to an electronic
computer. A subscriber can access to the computer by telephone and
receive an answer in various languages (French, English, etc.) . A
similar system (using the IBM-7770 computer) with a memory volume of 60 words
can answer 750 simultaneous telephone inquiries concerning prices on an
A unique trend in the investigation of speech signals by computer,
resulting from the search for methods to reduce the quantity of information
concerning the sound of speech, has appeared in the usage of the "analysis-
synthesis" method, suggested by K. Stevens et al. [86, 98, 127]. In ,
the initial determination of speech signal parameters is performed by compar-
ison of the input spectrum with a spectrum formed in the system as a result
of combination of six curves for the first formants and six for the second
formants, i.e. 36 standard spectral curves in all. Eight parameters are
changed in forming the standard spectra: the frequency and width of the bands
of the first three formants, the frequency of the fourth formant and the
position of the zero in the source spectrum. It has been stated that this
method allows sufficiently precise investigation of speech and the production
of initial data for the development of speech recognition apparatus.
In another suggestion, the speech signal is subjected directly to / 393
mathematical analysis in the computer without supplementary apparatus and is
approximated by 30 orthogonal functions in the form of an exponentially damped
Fourier series. When the sound is changed, the numerical values of the
coefficients are changed, while the functions themselves remain unchanged.
The symbols produced are used for synthesis  .
Similar works are also being conducted in Japan [91, 98] and the Polish
People's Republic .
A decrease in the influence of the subjective properties of the speakers
on the result of machine recognition can be achieved to a certain extent by
using the principle of self-tuning and self-teaching. If many people are
talking, as in an ordinary discussion, this principle is, of course,
ineffective. In  , the results are described of a work on the recognition
of patterns using these methods in combination with "analysis-synthesis"
methods. Using 20 characteristics, for example, a great quantity of combin-
ations can be produced, but many of these characteristics are nonessential,
so that submatrices are formed including only the essential characteristics.
The degree of importance of each characteristic and combination of character-
istics is determined during the course of self- teaching, i.e. by synthesis
of the required "dictionary." At first, the machine generates all char-
acteristics and, analyzing the relationships between individual cells,
determines the weight of each one, generating new relationships and determining
their weights in turn. Recognition is performed by the comparison method.
The number of proper answers achieved by recognizing known stylized portraits
and spectrograms of speech was 80-100%, for unknown portraits and spectrograms
An algorithm for machine recognition using elements of self-tuning was
also developed in  . With a memory volume of eighteen words, the
accuracy of differentiation achieved by this self-tuning system in processing
information input to the machine in four languages was 96%; when twenty words
were used, the accuracy was 86%. The system could successfully distinguish a
voice concerning which no preceding experience had been accumulated.
A program was developed which, with a limited memory volume (numbers from
zero to nine, plus, minus, equals, parentheses, etc., 83 words in all) the
computer "learned" to recognize these words both for a known and for a random
speaker (although with poorer results) , to perform arithmetic operations on
command with the number introduced by voice, to print out the results and to
translate them into another language .
A computer machine for the recognition of speech signals not only has the
advantage that it makes it possible to use partial information and algorithms
developed for the machine for translation from one language to another, but
also has the advantage that its large memory volume and high operating speed
allow speech signals to be separated for analysis into extremely short
amplitude and frequency sectors, and allows data to be compared with standard
data quite rapidly. On the other hand, in order to achieve universality of
recognition, i.e. independence of the results of analysis from subjective
properties of the speakers, it should not contain comparison elements; also,
large computers are expensive and cannot be mobile. Therefore, in spite of
certain advantages over special machines, the latter, i.e. direct analysis
machines, are more promising for the final solution of the problem of machine
recognition of speech signals and usage of these signals in various control
and communications systems.
In this article we have touched upon the most general part of the work
in the area of machine recognition and have not discussed the problem of
speech synthesis at all. The main factor reducing the accuracy of the oper-
ation of speech recognition apparatus is the inconstancy of the formant
indicators, depending to a great extent on the individualities of the speakers,
and the difficulty of segmenting. However, the results of investigations
have already been used in communications technology, as well as military
affairs. For example, vocoders have been developed in the USA which are used
in aviation [2, 133]. As the apparatus is improved, the area of their
application will doubtless increase, as a result of which cybernetic systems
will receive new elements which will react to spoken commands without inter-
1. "Analysis of Man's Behavior by His Speech," Elektronika, vol. 37, No. 9, /3JM
2. "The Army Orders a Miniature Vocoder," Elektronika, vol. 37, No. 15, 1964.
3. Varshavskiy, L. A., I. M. Litvak, "Investigation of Formant Composition and
Certain Other Physical Characteristics of the Sounds of Russian Speech,"
Problemy Fiziol. Akustiki, No. 3, 1955.
4. Vasilenko, A. T. , Yu. N. Denisov, "An Electromechanical Harmonics Analyzer,"
Pribory i Tekhnika Eksperimenta , No. 6, 1963.
5. Voloshin, G. Ya. , "Spectral Analysis of Speech Signals by Electronic
Computer," Sbornik Trudov Inta Matematiki SO AN SSSR^ Vyahislitel'nyye
Sistemy [Collected Works of Institute of Mathematics, Siberian Department
Academy of Sciences USSR, Computer Systems], No. 10, Novosibirsk, 1964.
6. Voloshin, G. Ya. , "Analog-Digital Converter for Input of Speech Signals to
Electronic Computers," Sbornik Trudov Inta Matematiki. SO AN SSSE^
Vyahislitel'nyye Sistemy [Collected Works of Institute of Mathematics,
Siberian Department Academy of Sciences USSR, Computer Systems] , No. 10,
7. Voloshin, G. Ya. , "The Frequency of Sampling of a Random Function with
Correlation-Spectral Analysis," Sbornik Trudov Inta Matematiki SO AN
SSSEj Vyahislitel'nyye Sistemy [Collected Works of Institute of Mathe-
matics, Siberian Department Academy of Sciences USSR, Computer Systems],
No. 14, Novosibirsk, 1964.
8. "The Computer Answers the Telephone," Elektronika, vol. 37, No. 5, 1964.
9. Zagoruyko, N. G., "The Exchange of Oral Information Between Man and
Computer Systems," Sbornik Trudov Inta Matematiki SO AN SSSR^ Vychisli-
tel'nyye Sistemy [Collected Works of Institute of Mathematics, Siberian
Department Academy of Sciences USSR, Computer Systems], No. 10, Novo-
10. Zagoruyko, N. G., "Error in Calculating Energy and Envelope of Speech
Signal by Computer," Sbornik Trudov Inta Matematiki SO AN SSSR> Vychis-
litel'nyye Sistemy [Collected Works of Institute of Mathematics, Siberian
Department Academy of Sciences USSR, Computer Systems], No. 10, Novo-
11. Zagoruyko, N. G. , G. Ya. Voloshin, V. N. Yelkina, "Automatic Recognition
of Speech Patterns," Sbornik Trudov Inta Matematiki SO AN SSSR,
Vyohislitel'nyye Sistemy [Collected Works of Institute of Mathematics,
Siberian Department Academy of Sciences USSR, Computer Systems], No. 14,
12. Kakauridze, A. G., "Some Problems of Coding of Speech Vowel Sounds,"
Trudy Inta Elektroniki, Avtomatiki i Telemekhaniki AN GruzSSR [Works of
the Institute of Electronics, Automation and Telemechanics, Academy of
Sciences Georgian SSR] , No. 1, 1960.
13. Kakauridze, A. G., "Experimental Device for Automatically Distinguishing
Among a Limited Set of Speech Commands," Elementy Vyahislitel'noy
Tekhniki i Mashinnyy Perevod^ In-t. Elektroniki 3 Avtomatiki i Teleme-
khaniki AN GruzSSR [Elements of Computer Equipment and Machine
Translation, Institute of Electronics, Automation and Telemechanics,
Academy of Sciences Georgian SSR], Tbilisi, 1964.
14. Caldwell, V., E. Glasser, D. Stewart, "An Analog Model of the Ear,"
Sbornik Problemy Bioniki [Collection of Problems of Bionics] , Moscow,
15. Lebman, Yu. A., V. N. Sobolev, "Analog-Digital Converter for Input of
Speech Signal to Computer Machine," Elektrosvyaz ' , No. 8, 1963.
16. Myasnikov, L. L. , "Objective Recognition of Speech Sounds," Zh. Tekhnich.
Fiz. , vol. 13, No. 3, 1943.
17. Myasnikov, L. L. , "The Sounds of Speech and Their Objective Recognition,"
Vestnik Leningr. Universiteta , No. 3, 1946.
18. Myasnikov, L. L., "Physical Investigation of the Sounds of Russian Speech,"
Izvestiya Akad. Nauk SSSR, Seriya Fiz. , v. 13, No. 6, 1949.
19. Myuler, P., T. Martin, F. Puttsrat, "General Principles of Operation in
Neuron Networks and Their Application to the Recognition of Acoustical
Patterns," Sbornik Problemy Bioniki, Moscow, 1965.
20. Pirogov, A. A. , "Theoretical Considerations Concerning a Method for Coding
and Synthesis of Speech Information Using a Harmonic Function,"
Dokl. na Vsesoyuzn. Soveshoh. Sektsii Rechi Komissii po Akustike AN SSSR
[Report at Ail-Union Conference of Speech Section of Commission on
Acoustics, Academy of Sciences USSR], Moscow, 1958.
21. Pirogov, A. A., "Harmonic System for Compression of Speech Spectra,"
Elektrosvyaz ' , No. 3, 1959.
22. Sapozhkov, M. A., Reohevoy Signal v Kibernetike i Svyazi [The Speech
Signal in Cybernetics and Communications], Moscow, 1963.
23. Solodovnikov, V. V., Statisticheskaya Dinamika Linenykh Sistem Avtomati-
eheskogo Upravleniya [Statistical Dynamics of Linear Automatic Control
Systems], Moscow, 1960.
24. Fant, G., Akusticheskaya Teoriya Recheobrazovaniya [The Acoustical Theory
of Speech Formation], Moscow, 1964.
25. Kharkevich, A. A., Spektry i Analiz [Spectra and Analysis], Moscow, 1953.
26. Kharkevich, A. A. , Ocherki Obshchey Teorii Svyazi [Essays on the General
Theory of Communications], Moscow, 1955.
27. Khrapovistkiy, A. V., "Theoretical and Experimental Investigation of a
Cosine-Logarithmic Vocoder," Dokl. na Vsesoyuzn. Sovesheh. Sektsii
Reohi Komissii po Akustike AN SSSR [Report at All -Union Conference of
Speech Section of Commission on Acoustics, Academy of Sciences USSR],
28. Tsemel', G. I., "Determination of the Invariant Characteristics of
Plosive Sounds on the Basis of Clipped Speech Signals," Izvestiya
Akad. Nauk SSSR Otdel Tekhnicheskikh Nauk Energetika i Avtomatika, No. 4,
29. Tsemel', G. I., "Automatic Recognition of Speech Sounds," Zavubezhnaya
Radioelektronika , No. 4, 1962.
30. Tsemel', G. I., "Recognition of a Small Group of Words by Characteristic
Features of the Speech Signal," Sbornik Problemy Peredachi Informatsii,
No. 16, Moscow, 1964.
31. Chistovich, L. A., "Influence of Frequency Limitations on Intelligibility /395
of Russian Consonant Sounds," Dokl. na Vsesoyuzn. Sovesheh. Sektsii
Reohi Komissii po Akustike AN SSSR [Report at All -Union Conference of
Speech Section of Commission on Acoustics, Academy of Sciences USSR],
32. Abstracts of Papers on Speech Analysis, Stockholm Speech Communications
Seminar, 1962, Royal Institute on Technology, Stockholm, Sweden, The
Journal of the Acoustical Society of America (=JASA) , Vol. 35, No. 7,
33. Bennett, W. R. , "The Correlatograph. A Machine for Continuous Display of
Short Term Correlation." Bell System Techn. J. 3 Vol. 32, Sept. 1953.
34. Billings, A.R. , "Simple Multiplex Vocoder," Electronic and Radio Engr 3
Vol. 36, No. 5, 1959.
35. Billings, A.R. , "Communication Efficiency of Vocoders Comparison of Low-
Power and Conventional Systems," Electronic and Radio Engr 3 Vol. 36,
No. 12, 1959.
36. Bogert, B. P., "Vobanc - a Two-to-One Speech Band-Width Reduction System,"
JASA, Vol. 28, No. 3, 1956.
37. Borenstein, D.P., "Spectral Characteristics of Digit-Simulating Speech
Sounds," Bell System Techn, J. 3 Vol. 42, No. 6, 1963.
38. Brick, D.B., and G. G. Pick, "Microsecond Word-Recognition System," IEEE
Trans. Electronic Comput. EC- 13, No. 1, 1964.
39. Campanella, S.A. , "A Survey of Speech Bandwidth Compression Techniques,"
IRE Trans. Audio 3 AU-6, No. 5, 1958.
40. Campanella, S. I. , Coulter, D. C. , and P. Engler, "Speech Recognition by
Formant Pattern Matching in N-dimensional Space," JASA, Vol. 36, No. 5,
41. Campanella, S. I., Coulter, D. C. , and R. Irons, "Influence of Transmission
Error on Formant Coded Compressed Speech Signals," J. Audio Engng. Soc. 3
Vol. 19, No. 2, 1962.
42. Carre, R. , Lancia, R. , Paille, J., and R. Gsell, "Etude et realisation
d'un detecteur de melodie pour analyse de la parole," Onde electr. ,
Vol. 43, No. 434, 1963.
43. Chang, S. H. , "Two Schemes of Speech Compression System," J AS A, Vol. 28,
No. 4, 1956.
44. Chang, S. H. , Pihl, C. E. , and I. Wiren, "The Intervalgram as a Visual
Representation of Speech Sounds," JASA, Vol. 23, No. 6, 1951.
45. Cherry, C. , Halle, M. , and R. Jacobson, "Toward a Logical Description of
Languages in Their Phonemic Aspects," Language, Vol. 29, No. 1, 1953.
46. Clapper, C. L. , "Digital Circuit Techniques for Speech Analysis," IEEE
Trans. Communication and Electronics , CE-11, No. 66, 1963.
47. Cramer, B. , "Sprachsyn these zur Obertragung mit sehr geringer Kanalkapazitat,
Eaahvichtentechn. Z. ,No. 8, 1964.
48. David, E. E. , and H. S. McDonald, "Note on Pitch-Synchronous Processing of
Speech," J AS A, Vol. 28, No. 6, 1956.
49. David, J., Schroeder, M. K. , Logan, B. F. , and H. J. Prestigiacomo,
"Voice-Excited Vocoders for Practical Speech Bandwidth Reduction," IRE
Trans. Inform. Theory, IT-8, No. 5, 1962,
50. Davis, K. H. , Bidulph, R. , and S. Balashek, "Automatic Recognition of
Spoken Digits," JASA, Vol. 24, No. 6, 1952.
51. Denes, P., "The Design and Operation of the Mechanical Speech Recognizer
at University College London," J. Brit. IRE, Vol. 19, No. 4, 1959.
52. Denes, P., and M. V. Mathews, "Spoken Digit Recognition Using Time -Frequency
Pattern Matching," J ASA, Vol. 32, No. 11, 1960.
53. Deuber, G. , "Varied Approaches Used to Develop System for Reliable
Recognition of Voice Commands," Electronic News, Vol. 8, No. 383, 1963.
54. Deweze, A., "Techniques des reconnaissance automatique des formes visuelles
et sonores," Automatisme , Vol. 9, No. 3, 1964.
55. Dersch, W. C. , "Speech Operated Safety Switch," Electronics, Vol. 36, No.
56. Dolansky, I. 0., "Instantaneous Pitch-Period Indicator," JASA, Vol. 27, No.
57. Dreyfus-Graf, J., "Phonetographe et Subformants," Bull. Techn. PIT, Bern,
No. 2, 1957.
58. Dreyfus-Graf, J., "Phonetographe: Present et Future," Bull. Techn. PIT, Bern,
No. 5, 1961.
59. Dudley, H. W. , "Remaking Speech," JASA, Vol. 11, No. 2, 1939.
60. Dudley, H. W. , "Speech Analysis and Synthesis System," JASA, Vol. 22, No.
61. Dudley, H. W. , "Phonetic Pattern Recognition for Narrowband Transmission,"
JASA, Vol. 30, No. 8, 1958.
62. Dunn, H. K. , "Methods of Measuring Vowel Formant Bandwidths," JASA, Vol. 33,
No. 12, 1961.
63. Fano, R. M. , "The Information Theory Point of View in Speech Communication,"
JASA, Vol. 22, No. 10, 1950.
64. Fant, G. , "Acoustic Theory of Speech Production with Calculations Based
on X-Ray Studies of Russian Articulations," Mouton & Co, s'Gravenhage,
65. Fant, G. , Fintoft, K., Liliencrants, J., Lindblom, B. , and J. Martony, /396
"Formant-Amplitude Measurements," JASA, Vol. 35, No. 11, 1963.
66. Flanagan, I. L. , "A Difference Limen for Vowel Formant Frequency," JASA,
Vol. 27, No. 3, 1955.
67. Flanagan, J. L. , "Automatic Extraction of Formant Frequences from
Continuous Speech," JASA, Vol. 28, No. 1, 1956.
68. Flanagan, J. L. , "Band-Width and Channel Capacity Necessary to Transmit
the Formant Information of Speech," JASA, Vol. 28, No. 4, 1956.
69. Flanagan, J. L. , "A Resonance-Vocoder and Baseband Complement: A Hybrid
System for Speech Transmission," IKE. Trans. Audio, AU-8, May-June 1960.
70. Forgie, J. W. , and C. W. Hughes, "A Real-Time Speech Input System for a
Digital Computer," JASA, Vol. 30, No. 7, 1958.
71. Forgie, J. W. and C. D. Forgie, "Results Obtained from a Vowel Recognition
Computer Programme," JASA, Vol. 31, No. 9, 1959.
72. Fujisaki, H. , "Automatic Extraction of Fundamental Period of Speech by
Auto-Correlation Analysis and Peak Detaction," JASA, Vol. 32, No. 11,
73. Gabor, D. , "New Possibilities in Speech Transmission," J. IKE, Vol. 94,
74. Galdwell, W. F., "Recognition of Sounds by Cochlear Patterns," IEEE Trans.
Military Electronics, MIL- 7, No. 2-3, 1963.
75. Gerstman, L. J., Liberman, A. M. Delattre, P. C. , and F. S. Cooper, "Rate
and Duration of Change in Formant Frequency as Cues for Identification
of Speech Sounds," JASA, Vol. 26, No. 9, 1954.
76. Gold, B. , and C. Rader, "Bandpass Compressor: A New Type of Speech-
Compression Device," JASA, Vol. 36, No. 6, 1964.
77. Golden, R. M. , MacLean D. I., and A. I. Prestigiacomo, "Frequency Multiplex
System for a 10-Spectrum-Channel Voice-Excited Vocoder," JASA, Vol. 36,
No. 10, 1964.
78. Gribenski, A., "The Pitch of Sound, Its Measuring and Perception," Nature,
Vol. 173, No. 4, 1957.
79. Haggard, M. P., "In Defense of the Formant," Vhonetiea, Vol. 10, No. 3-4,
80. Harris, C. M. , and W. M. Waite, "Gaussian-Filter Spectrum Analyzer,"
Vol. 35, No. 4, 1963.
81. Harris, C. M. and M. R. Weiss, "Pitch Extraction by Computer Processing of
High-Resolution Fourier Analysis Data," JASA, Vol. 35, No. 3, 1963.
82. Hellwarth, G. A., "Speech-Formant Measurement with a Continuously Tuned
Automatic Tracking Filters," JASA, Vol. 35, No. 5, 1963.
83. Heydeman, P. , "Ein Modellversuch zum Frequenzunterscheidungsvermogen des
Ohres," Aoustioa, Vol. 13, No. 2, 1963.
84. Hillix, W. , "Use of Two Nonacoustic Measures in Computer Recognition of
Spoken Digits," JASA, Vol. 35, No. 12, 1963.
85. Howard, C. R. , "Speech Analysis Synthesis Scheme Using Continuous Parameter,"
JASA, Vol. 28, No. 6, 1956.
86. Howard, C. R. , Chang, S. H. , and M. J. Carrabes, "Analysis and Synthesis of
Formants and Moments of Speech Spectra," JASA, Vol. 28, No. 4, 1956.
87. Huggins, W. H. , "A Phase Principle for Complex-Frequency Analysis and its
Implications in Auditory Theory," JASA, Vol. 24, No. 6, 1952.
88. Hughes, G. W. , "Identification of Speech Sounds by Means of a Digital
Computer," JASA, Vol. 31, No. 1, 1959.
89. Husson, R. , "Zur Spektraistruktur menschlicher Vokale aller Stimmstarken,"
Phonetiea, No. 1-2, 1963.
90. Inomata, S. , "An Auditory Pattern Processing Model," IEEE Trans. Inform.
Theory, IT-9, No. 4, 1963.
91. Inomata, S. , "Speech Recognition and Generation by a Digital Computer,"
Res. Electrotechn. Labs, No. 645, 1963.
92. Ithell, A. H. , "A Determination of the Acoustical Input Impedance Character-
istics of Human Ears," Acustica, Vol. 13, No. 4, 1963.
93. Jaffe, J., Cassotta, L. , and S. Feldstein, "Markovian Model of Time
Patterns of Speech," Science, Vol. 144, No. 3260, 1964.
94. Jakobson, R. , Fant, G. G. , and M. Halle, "Preliminaries to Speech Analysis,
The Distinctive Features and Their Correlates," Massachusetts Institute
of Technology. Aooust. Lab. Teahn. Rept. No. 13, 1952.
95. Jakobson, R. , "Die Verteilung der stimmhaften und stimmlosen Gerauschlaufe
im Russischen, Festschrift fur Max Vasmer," Berlin, 1956.
96. Johnson, W. , "System to Generate Speech from Written Pattern Shown,"
Electronic News, Vol. 8, No. 392, 1963.
97. Kacprowski, J., "Speech Compression by Means of Analysis-Synthesis Methods,"
Polish Academy of Sciences, Proa. Vibration Probl. , Vol. 5, No. 3, 1964.
98. Kadokava, J. and K. Nakata, "Formant Frequency Extraction by Analysis-by- /397
Synthesis Technique," J. Radio Res. Labs, Vol. 10, No. 49, 1963.
99. Kadokawa, J., and K. Nakata, "Analysis of Speech by Vocal Tract Configu-
ration," J. Radio Res. Labs, Vol. 11, No. 54, 1964.
100. Kaliaian, M. V., "Phonetic Typewriter of Speech," JASA, Vol. 36, No. 6,
101. Karplus, H. B. , "Correlation Hypothesis to Explain the Fine Frequency Dis-
crimination of the Ear," JASA, Vol. 35, No. 5, 1963.
102. Klass, P. I., "Vocoder Increases Channels Security," Aviat. Week, Vol. 73,
No. 4, 1960.
103. Klumpp, R. G., and I. C. Webster, "Intelligibility of Time-Compressed
Speech," JASA, Vol. 33, No. 3, 1961.
104. Loenig, W. , "A New Frequency Scale for Acoustic Measurements," Bell Labs,
Rec. , Vol. 27, No. 7, 1949.
105. Ladefoged, P., "Acoustic Correlate of Subglottal Activity," JASA, Vol. 35,
No. 5, 1963.
106. Ladefoged, P., and N. P. McKinney, "Loudness, Sound Pressure and Subglottal
Pressure in Speech," JASA, Vol. 35, No. 4, 1963.
107. Latil de, P., "La parole est aux calculateurs ," Electronique Industr.
No. 76, 1964.
108. Lehiste, I., and G. E. Peterson, "Some Basic Considerations in the Analysis
of Intonation," JASA, Vol. 33, No. 4, 1961.
109. Lieberman, P., "Perturbations in Vocal Pitch," JASA, Vol. 33, No. 5, 1961.
110. Liberman, A. M. , Ingeman, F. , Lisker, L. , Delattre, P. C. , and F. S.
Cooper, "Minimal Rules for Synthesizing Speech," JASA, Vol. 31, No. 11,
111. Licklider, I. C. , "Effects of Amplitude Distortion Upon the Intelligibility
of Speech," JASA, Vol. 18, No. 2, 1964.
112. Licklider, I. C. , "The Intelligibility of Amplitude-Dichotomized. Time-
Quantized Speech Waves," JASA, Vol. 22, No. 6, 1950.
113. Licklider, I. C. , "Influence of Phase Coherence Upon the Pitch of Complex
Periodic Sounds," JASA, Vol. 27, No. 5, 1955.
114. Licklider, I. C, "Man-Computer Symbiosis," IRE Trans. HFL-1, No. 1, 1960.
115. "Machines Controlled by Spoken Commands," Datamation, Vol. 8, No. 6, 1962.
116. Martin, T. B. , and I. I. Talvagel, "Application of Neural Logic to Speech
Analysis and Recognition," IEEE Trans. Military Electronics , MIL- 7,
No. 2-3, 1963.
117. Miller, A. E., and M. V. Mathews, "Investigation of the Glottal Wave-
shape by Automatic Inverse Filtering," JASA, Vol. 35, No. 11, 1963.
118. Miller, R. L. , "Improvements in the Vocoder," JASA, Vol. 25, No. 4, 1953.
119. Nicolau, E., Weber, L. , and S. Gavat, "Aparate pentru recunoasterca
automata a vocalelor," Automatica ei eleatvonica, Vol. 7, No. 6, 1963.
120. Noll, A. M. , "Short-Time Spectrum and 'Cepstrum' Techniques for Vocal-
Pitch Detection," JASA, Vol. 36, No. 2, 1964.
121. Oeren, F. W. , "Some Critical Observation on the Formant Theory of Vowel
Recognition," Phonetica, Vol. 10, No. 1-2, 1963.
122. Olson, H. F. , and H. Belar, "Phonetic Typewriter," JASA, Vol. 28, No. 6,
123. Olson, H. F., and H. Belar, "Phonetic Typewriter III," JASA, Vol. 33,
No. 11, 1961.
124. Olson, H. F., and H. Belar, "Syllable Analyzer, Coder and Synthesizer for
the Transmission of Speech," IRE Trans. Audio, AU-10, No. 11, 1962.
125. Olson, H. F. , Belar, H. , and R. Sobrino, "Demonstration of a Speech
Processing System Consisting of a Speech Analyzer, Translator, Typer
and Synthesizer," JASA, Vol. 34, No. 10, 1962.
126. Olson, H. F. , and H. Belar, "Performance and a Code-Operated Speech
Synthesizer," JASA, Vol. 38, No. 5, 1964.
127. Paul, A. P. House, A. S. and K. N. Stevens, "Automatic Reduction of Vowel
Spectra: An Analysis-by-Synthesis Method and its Evaluation," JASA,
Vol. 36, No. 2, 1964.
128. Peterson, G. E., "Design of Visible Speech Devices," JASA, Vol. 26, No. 3,
129. Peterson, E. , and F. S. Cooper, "Peakpicker: A Band-Width Compression
Device," JASA, Vol. 29, No. 6, 1957.
130. Peterson, G. E. and I. Lehiste, "Identification of Filtered Vowels," JASA,
Vol. 31, No. 6, 1959.
131. Peterson, G. E. , Sivertsen, E., and D. L. Subrahmanyam, "Intelligibility
of Diphasic Speech," JASA, Vol. 28, No. 3, 1956.
132. Petrick, S. R. , "Talking to a Computer," New Scientist, No. 235, 1961.
133. Phyie, D. L. and I. E. Toffier, "Some Features of the Army Channel
Vocoder," JASA, Vol. 35, No. 12, 1963.
134. Pick, G. G.,Gray, C. B., and D. B. Brick, "The Solenoid Array - a New
Computer Element," IEEE Trans. Electronic Computers, EC- 13, No. 1, 1964.
135. Pinson, E. N. , "Pitch-Synchronons Time-Domain Estimation of Formant Fre-
quencies and Bandwidth," JASA, Vol. 35, No. 8, 1963.
136. Pollack, I., and L. Picket, "Effect of Noise and Filtering on Speech
Intelligibility at High Levels," JASA, Vol. 29, No. 12, 1957.
137. Potter, R. K. , Knopp, G.A. , and H. C. Green, "Visible Speech," Van /398
Nostrand, New York, 1947.
138. Potter, R. K. and J. C. Steinberg, "Toward the Specification of Speech,"
JASA, Vol. 22, No. 6, 1950.
139. Proceedings of the Speech Communication Seminar, Stockholm, August 29
to September 1, 1962, Publ. Speech Transmission Lab. Royal Inst.
Technology, Stockholm, 1964.
140. Pruzansky, S. , and P. D. Briener, "Automatic Talker Recognition Using
Time- Frequency Pattern Matching," JASA, Vol. 33, No. 6, 1961.
141. Pun, L., "The Phonetograph," Control, Vol. 7, January 1963.
142. Rader, C. , "Spectra of Vocoder-Channel Signals," JASA, Vol. 35, No. 5,
143. Rader, C. M. , "Vector Pitch Detection," JASA, Vol. 36, No. 10, 1964.
144. Rawley, I. R. , "Converting Digital Data to Voice," Electronic Industv.
Vol. 23, No. 4, 1964.
145. Rais, A., "Vowel Recognition in Clipped Speech," Nature, Vol. 181, No. 3,
146. Righini, G. U. , "A Pitch. Extractor of the Voice," Acustica, Vol. 13,
No. 4, 1963.
147. Rosen, G. , "Dynamic Analog Speech Synthesizer," JASA, Vol. 30, No. 3,
148. Sackschewsky, V. E., and H. L. Oestreicher, "Pattern Recognition as a
Problem in Decision Theory and an Application to Speech Recognition,"
IEEE Trans. Military Electronics, MIL-7, No. 2-3, 1963.
149. Sakai, T. , Dochita, S., Nagata, K. I., "Phonetic Typewriter," JASA, Vol.
35, No. 7, 1963.
150. Sakai, T. , and S. I. Inone, "New Instruments and Methods for Speech
Analysis," JASA, Vol. 32, No. 4, 1960.
151. Schief, R. , "Koinzidenz-Filter als Modell fur das menschiche Tonbohenun-
terscheidungsvermogen," Kybernetik, Vol. 2, H. L, 1963.
152. Scholtz, P. N. and R. Bakis, "Spoken Digit Recognition Using Vowel-
consonants Segmentation," JASA, Vol. 34, No. 1, 1962.
153. Schroder, M. R. , "New Approach to Time Domain Analysis and Synthesis,"
JASA, Vol. 31, No. 6, 1959.
154. Schroder, M. R. , and T. H. Crystal, "Auto- Correlation Vocoder," JASA,
Vol. 32, No. 7, 1960.
155. Schroder, M. R. , and E. E. David, "A Vocoder for Transmitting 10 kc/s
Speech Over a 3.5 kc/s Channel," Acustica, Vol. 10, No. 1, 1960.
156. Seki, H. , "A New Method of Speech Transmission by Frequency Demultipli-
cation and Multiplication," J. Acoust. Soc. Japan, 14 June 1958.
157. Siedler, G. , "Untersuchungen uber die Bedeutung bestimmter Tonfrequenz-
bander fur die Verstandlichkeit synthetischer Sprache und uber Anderung
der Sprachverstandlichkeit bei Kanalvertauschungen," Z. angew. Phys.
Vol. 13, Nr. 6, 1961.
158. Simmons, P. L. , "Automation of Speech, Speech Synthesis and Synthetic
Speech. A Bibliographic Survey from 1950-1960," IRE Trans. Audio, AU-9,
November- December 1961.
159. Slaymaker, F. H. , "Bandwidth Compression by Means of Vocoder," IRE Trans.
Audio, AU-9, January -February 1960.
160. Smith, C. R. , "A Phoneme Detector," JASA, Vol. 23, No. 4, 1951.
161. Smith, C. R, , "The Analysis and Automatic Recognition of Speech Sounds,"
Electronic Engng. Vol. 24, No. 8, 1962.
162. Smith, S. L. , "Man-Computer Information Transfer," Electro-Technology ,
Voo. 72, August 1963.
163. Speech Recognition Gets Push From Synthesizer, Electronics, Vol. 34, No.
164. Steele, K. W. , and L. E. Cassel, "Quality Improvement in the Channel Vo-
coder," JASA, Vol. 35, No. 5, 1963.
165. Stevens, K. N. , "Acoustical Analysis of Speech," JASA, Vol. 30, No. 7,
166. Sund, H. , "A Sound Spectrometer for Speech Analysis," Trans. Royal Inst.
Technology, Stockholm, No. 112, 1957.
167. Suzuki, I., Kadokawa, J., Nakata, K. , "Formant- Frequency Extraction by
the Method of Moment Calculations," JASA, Vol. 35, No. 9, 1963.
168. Suzuki, I., and K. Nakata, "Phonemic Classification and Recognition of
Japanese Monosyllables," J. Radio Res. Labs., Vol. 10, No. 49, 1963.
169. Suzuki, I., Nakata, K. , and K. Maezono, "Speech Data Analyzed by
Computer Program, Classification and Recognition of Japanese Mono-
syllables," IEEE Trans. Military Electronics, MIL-7, No. 2-3, 1963.
170. Tarneczy, T. H. , "Vowel Formant Bandwidths and Synthetic Vowels," JASA,
Vol. 34, No. 6, 1962.
171. Thierney, J. Gold, B. Sferrino, V., Dumanian, I. A., and E. Aho, "Channel
Vocoder with Digital Pitch Extractor," JASA, Vol.36, No. 10, 1964.
172. Toffler, I. E. , and F. B. Wade, "Pitch Extractor, Using Clippers,"
JASA, Vol. 36, No. 5, 1964.
173. Uhr, L. , "Recognition of Letters, Pictures and Speech by a Discovery
and Learning Program," WESCON Techn. Tapers, Vol. 8, p. 4, August 25-
174. Vilbig, F., "An Analysis of Clipped Speech," JASA, Vol. 27, No. 1, 1955. /399
175. Vilbig, F., "Speech Compression," JASA 3 Vol. 28, No. 1, 1956.
176. Vilbig, F. , "Improvement of Simplification of the Scanvocoder and its
Connection to a Correlation Pulse Code System," J ASA, Vol. 23, No. 4,
177. Vilbig, F. , and K. H. Haase, "Uber einige Systeme zur Sprachbandkompression,
Nachrichtentechn. Fachber." NTF, Vol. 3, 1956.
178. Vilbig, F. , and K. H. Haase, "Some Systems for Speech-Band Compression,"
JASA 3 Vol. 28, No. 4, 1956.
179. Wallace, J. C. , "Comparative Evaluation of Pitch-Signal Indicators," JASA 3
Vol. 35, No. 5, 1963.
180. Waller, R. , "Self-programming Pattern Recognizer, Measurements and Control,"
No. 3, 1964.
181. Waller, R. , "Sceptron," J. Scient. Instruments 3 Vol. 41, No. 5, 1964.
182. Weiss, M. R. , and C. M. Harris, "Computer Technique for High-Speech
Extraction of Speech Parameters," JASA, Vol. 35, No. 2, 1963.
183. Widrow, B. , Groner, G. F., Hu, M. I. C. , Smith, F. W. , Specht, D. F., and
L. R. Talbert, "Practical Applications for Adaptive Data-Processing
Systems," WESCON Techn. Papers 3 Vol. 7, No. 7, 1963.
184. Winckel, F. , "Tonhohenextractor fur Sprache mit Gleichstromanzeige,"
Phonetica, Nr. 3-4, 1964.
185. Wiren, I., and H. L. Stubbs, "Electronic Binary Selection System for
Phoneme Classification," JASA 3 Vol. 28, No. 6, 1956.
186. Wood, D. E. and T. L. Hewitt, "New Instrumentation for Making Spectro-
graphs Pictures of Speech," JASA, Vol. 35, No. 8, 1963.
187. Wood, D. E. , "New Display Formant and a Flexible-Time Integration for
Spectral-Analysis Instrumentation," JASA, Vol. 36, No. 4, 1964.
188. W. S. , Dr., "Identification des personnes par la spectrographie vocale,"
Automatisme , Vol. 8, No. 7-8, 1963.
189. Yoshima, T. , "Japan Firm Builds Spoken Voice Digit Recognizer," Electronic
News, Vol. 8, No. 390, 1963.
190. Zwicker, E. , "Moglichkeiten zur Spracherkennung uber den Tastsinn mit
Hilfe eines Funktionsmodells des Gehors," Elektronische Rechenaniagen,
Vol. 6, H. 6, 1964.